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Abstract 

Conditional Random Fields (CRFs) constitute a popular and efRcient approach for super- 
vised sequence labelling. CRFs can cope with large description spaces and can integrate some 
form of structural dependency between labels. In this contribution, we address the issue of 
efficient feature selection for CRFs based on imposing sparsity through an penalty. We first 
show how sparsity of the parameter set can be exploited to significantly speed up training and 
labelling. We then introduce coordinate descent parameter update schemes for CRFs with £^ 
regularization. We finally provide some empirical comparisons of the proposed approach with 
state-of-the-art CRF training strategies. In particular, it is shown that the proposed approach 
is able to take profit of the sparsity to speed up processing and handle larger dimensional 
models. 

1 Introduction 

Conditional Random Fields (CRFs) , originally introduced in [16] , constitute a popular and effective 
approach for supervised structure learning tasks involving the mapping between complex objects 
such as strings and trees. An important property of CRFs is their ability to cope with large 
and redundant feature sets and to integrate some form of structural dependency between output 
labels. Directly modeling the conditional probability of the label sequence given the observation 
stream allows to explicitly integrate complex dependencies that can not directly be accounted for 
in generative models such as Hidden Markov Models (HMMs). Results presented in section [221 
will illustrate this ability to use large sets of redundant and non-causal features. 

Training a CRF amounts to solving a convex optimization problem: the maximization of the 
penalized conditional log-likelihood function. For lack of an analytical solution however, the CRF 
training task requires numerical optimization and implies to repeatedly perform inference over the 
entire training set during the computation of the gradient of the objective function. 

This is where the modeling of structure takes its toll: for general dependencies, exact inference 
is intractable and approximations have to be considered. In the simpler case of linear-chain CRFs, 
modeling the interaction between pairs of adjacent labels makes the complexity of inference grow 
quadratically with the size of label set: even in this restricted setting, training a CRF remains a 
computational burden, especially when the number of output labels is large. 

Introducing structure has another, less studied, impact on the number of potential features that 
can be considered. It is possible, in a linear-chain CRF, to introduce features that simultaneously 
test the values of adjacent labels and some property of the observation. In fact, these features often 
contain valuable information [23]. However, their number scales quadratically with the number of 
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labels, yielding both a computational (feature functions have to be computed, parameter vectors 
have to be stored in memory) and an estimation problem. 

The estimation problem stems from the need to estimate large parameter vectors based on 
sparse training data. Penalizing the objective function with the i'^ norm of the parameter vector 
is an effective remedy to overfitting; yet, it does not decrease the number of feature computations 
that are needed. In this paper, we consider the use of an alternative penalty function, the £^ norm, 
which yields much sparser parameter vectors [280 As we will show, inducing a sparse vector not 
only reduces the number of feature functions that need to be computed, but it can also reduce the 
time needed to perform parameter estimation and decoding. 

The main shortcoming of the regularizer is that the objective function is no longer differen- 
tiable everywhere, challenging the use of gradient-based optimization algorithms. Proposals have 
been made to overcome this difficulty: for instance, the orthant-wise limited-memory quasi-Newton 
algorithm [1] uses the fact that the £^ norm remains differentiable when restricted to regions in 
which the sign of each coordinate is fixed (an "orthant"). Using this technique, |12j reports test 
performance that are on par with those obtained with a £^ penalty, albeit with more compact 
models. Our first contribution is to show that even in this situation (equivalent test performance), 
the £^ regularization may be preferable as sparsity in the parameter set can be exploited to reduce 
the computational cost associated with parameter training and label inference. 

For parameter estimation, we consider an alternative optimization approach, which general- 
izes to CRFs the proposal of [10] (see also [9l Ej). In a nutshell, optimization is performed 
in a coordinate-wise fashion, based on an analytic solution to the unidimensional optimization 
problem. In order to tackle realistic problems, we propose an efficient blocking scheme in which 
the coordinate-wise updates are applied simultaneously to a properly selected group (or block) of 
parameters. Our main methodological contributions are thus twofold: (i) a fast implementation 
of the training and decoding algorithms that uses the sparsity of parameter vectors and (ii) a 
novel optimization algorithm for using £^ penalty with CRFs. These two ideas combined together 
offer the opportunity of using very large "virtual" feature sets for which only a very small number 
of features are effectively selected. As will be seen (in Section this situation is frequent in 
typical natural language processing applications, particularly when the number of possible labels 
is large. Finally, the proposed algorithm has been implemented as C code and validated through 
experiments on artificial and real-world data. In particular, we provide detailed comparisons, in 
terms of numerical efficiency, with solutions traditionally used for and £^ penalized training of 
CRFs in publicly available software such as CRF-I-+ ^5\, CRFsuite [25 and crfsgd [3]. 

The rest of this paper is organized as follows. In Section [2l we introduce our notations and 
restate more precisely the issues we wish to address, based on the example of a simple natural 
language processing task. Section [3] discusses the algorithmic gains that are achievable when 
working with sparse parameter vectors. We then study, in Section IH the training algorithm used 
to achieve sparsity, which implements a coordinate- wise descent procedure. Section [5] discusses 
our contributions with respect to related work. And finally, Section [B] presents our experimental 
results, obtained both on simulated data, a phonetization task, and a named entity recognition 
problem. 

^To be more precise, we consider in the following a mixed penalty which involves both £^ and squared i'^ terms, 
also called the elastic net penalty [35j . The sparsity of the solution is however controlled mostly by the amount of 
£^ regularization. 
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2 Conditional Random Fields and Sparsity 



Conditional Random Fields \16\ [27] are based on the following discriminative probabilistic model 



K 



pe(y|x) = -^r-— expl V Okfk{yt^i,yt,xt) > (l) 
M^) [t=ik=i J 

where x = (xi, . . . , xt) denotes the input sequence and y = (yi, . . . , yx) is the output sequence, 
also referred to as the sequence of labels. {fk}i<k<K is an arbitrary set of feature functions and 
{Ok}i<k<K are the associated real-valued parameter value^. The CRF form considered in ([T]) is 
sometimes referred to as linear-chain CRF, although we stress that it is more general, as yt and 
Xt could be composed not directly of the individual sequence tokens, but on sub-sequences (e.g., 
trigrams) or other localized characteristics. We will denote by Y, X, respectively, the sets in which 
yt and xt take their values. The normalization factor in ([1]) is defined by 



r T K > 
^e(x) = ^ exp<^ '^'^9kfk{yt-i,yt,xt)>. 

ygyT [ t=l k=l J 



(2) 



The most common choice of feature functions is to use binary tests such that fk{yt-i,yt, xt) is 
one only when the triplet {yt-i,yt,xt) is in a particular configuration. In this setting, the number 
of parameters K is equal to |yp x |X|train) where | • | denotes the cardinal and |X|train refers to the 
number of configurations of xt observed in the training set. As discussed in Section [3] below, the 
bottleneck when performing inference is the computation of the pairwise conditional probabilities 
Peiyt-i = y,yt = y'\^), for t = l, . . . , T and (y, y') G Y"^ for all training sequences, which involves 
a number of operations that scales as \Y\'^ times the number of training tokens. Thus, even in 
moderate size applications, the number of parameters can be very large and the price to pay for 
the introduction of sequential dependencies in the model is rather high, explaining why it is hard 
to train CRFs with dependencies between more than two adjacent labels. 

To motivate our contribution, we consider below a moderate-size natural language processing 
task, namely a word phonetization task based on the Nettalk dictionary [25], where |y| (the 
number of phonemes) equals 53 and \X\ is 26 (one value for each English letter). For this task, 
we use a CRF that involves two types of features functions, which we refer to as, respectively, 
unigram features, ^y^x, and bigram features, Xy\y^x- These are such that 

K 

'^Okfk{yt-i,yt,xt) = ^ ^iy^xHyt = y,xt = x) 

k=l y&Y,x&X 

+ X] >'y',y,xHyt-i = y', yt = y,xt = x) (3) 

{y',y)GY2,x€X 

where l(cond.) is equal to 1 when the condition is verified and to otherwise. 

The use of the sole unigram features {tJ'y,x}(y,x)eYxX would result in a model equivalent to 
a simple bag-of-tokens position-by-position logistic regression model. On the other hand, bigram 



^Note that various conventions are found in the literature regarding the treatment of the initial term (with index 
i = 1) in UJ. Many authors simply ignore the term corresponding to the initial position t — 1 for so-called (see 
below) bigram features. In our implementation, yo refers to a particular (always observed) label that indicates the 
beginning of the sequence. In effect, this adds a few parameters that are specific to this initial position. However, 
as the impact on performance is usually negligible, we omit this specificity in the following for the sake of simplicity. 
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features {^y',y,x}{y,x)£Y^xX helpful in modelling dependencies between successive labels. The 
motivations for using simultaneously both types of feature functions and the details of this ex- 
periment are discussed in Section |6l In the following, by analogy with the domain of constrained 
optimization, we refer to the subset of feature functions whose multiplier is non-zero as the "active" 
features. 
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Figure 1: The norm of the parameters estimated with standard ^^-regularized maximum like- 
lihood for the Nettalk task. Above: \fiy,x\ for the 53 phonemes y and 26 letters x. Below: 
\^y',y,x\ for the 53 phonemes y and 26 letters x. 

Figure [1] allows to visualize the magnitude of the parameter vectors obtained with the £'^- 
regularized maximum likelihood approach. Sparsity is especially striking in the case of the \',y,x 
parameters which are, by far, the most numerous (53^ x 26). Another observation is that this 
sparsity pattern is quite correlated with the corresponding value of |/Uj;_a;|: in other words, most 
sequential dependencies ^y',y,x are only significant when the associated marginal factor fj,y^x is. 
This suggests to take a closer look at the internal structure of the feature set. 

From this picture, one would expect to attain the same classification accuracy with a much 
reduced set of feature functions using an appropriate feature selection approach. These preliminary 
considerations motivate some of the questions that we try to answer in this contribution: (1) Is it 
possible to take profit of the fact that a large proportion of the parameters are null to speed up 
computations? (2) How can we select features in a principled way during the parameter estimation 
phase? (3) Can sparse solutions also result in competitive test accuracy? 

3 Fast Computations in Sparse CRFs 

3.1 Computation of the Objective Function and its Gradient 

Given N independent labelled sequences {x^*^ y^*)}^^, where both x^*-* and y*-*^ contain T^*) 
symbols, conditional maximum likelihood estimation is based on the minimization, with respect 
to 9, of 

N 

/(e) = -^logp,(y»|x») 

i=l 

N ( T(*) K 1 

= y: iog^.(x») - EE^^/'^(%-i'y?^^?) (4) 

i=l [ i=l k=l J 
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The function l{9) is recognized as the negated conditional log-hkehhood of the observations and will 
be referred to in the following as the logarithmic loss function. This term is usually complemented 
with an additional regularization term so as to avoid overfitting (see Section|3]below). The gradient 
of 1(9) is given by 

di{e) x^-^^ . . (i). 
-QQ- =2^2^Ep^{y|x«) /fc(y*-i'yt'^t ) 

i=l i=l 

N T(') 

-EE^(y*-i,yf\4^^) (5) 

i=l t=l 

where Ep^(y|^(i)^ denotes the conditional expectation given the observation sequence, i.e. 

Uy,y\xf)Ve{yt-i = y',yt = y\^^'^) (6) 

Although l{9) is a smooth convex function, it has to be optimized numerically. The computation of 
its gradient implies to repeatedly compute the conditional expectation in ([6]) for all input sequences 
x^*) and all positions t. 

3.2 Sparse Forward- Backward Algorithm 

The standard approach for computing the conditional probabilities in CRFs is inspired by the 
forward-backward algorithm for hidden Markov models: in the case of the parameterization of Q, 
the algorithm implies the computation of the forward 

f ai (y) = exp(/iy_^^ + Xyo,y,xi ) 
and the backward recursion 

= 1 ,3) 

{Pt{y') = Ey/5t+l(y)eXp(^y,^., + , + Xy'^y^xt + l), 

for all indices I < t < T and all labels y £ Y. Then, Zg{x) = Yly^Tiy) and the pairwise 
probabilities ^e{yt = y',yt+i = are given by 

at(y')exp(/iy,xt+i + Xy',y,xt+i)f3t+i{y)/Z0{x) 

These recursions require a number of operations that grows quadratically with \Y\. 

Let us now consider the case where the set of bigram features {Xy' ^y^xt+A (y' ,y)eY2 is sparse with 
only r{xt+i) <^ \Y\'^ non null values and define the matrix 

Mt+i{y',y) = exjp{Xy>^y^xt+i) - 1 

Observe that Mt-\-i{y' ,y) also is a sparse matrix and that the forward and backward equations 
may be rewritten as 



at+i{y) = exp{fiy^^^^-^)l J2My') + Y<^t{y')Mt+i{y' ,y) 

^ y' y' 

W) = J2vt+i{y) + Y,Mt+i{y\y)vt+i{y) (9) 
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where vt+iiy) = ^^'9{^J'y,xt+l)■ The resulting computational savings stem from the fact that 

the vector matrix products in ^ now only involve the sparse matrix Mt+i{y' , y). This means that 
they can be computed, using an appropriate sparse matrix implementation, with exactly r(xt+i) 
multiplications instead of \Y\'^. If the set {fJ'y,xt+i}yeY of unigram features is also sparse, one may 
use a similar idea although the computation savings will in general be less significant. 

Using the implementation outlined in Q, the complexity of the forward-backward procedure 
for the sequence x*-*^ can be reduced from T*^*^ x |yp to the cumulated sizes of the feature sets 
encountered at each position along the sequence. Thus, the complexity of the forward-backward 
procedure is proportional to the average number of active features per position in the parameter 
set rather than to the actual number of potentially active features. This observation suggests that 
it might even be possible to use some longer term dependencies between labels, as long as only a 
few of them are active simultaneously. 

It should be stressed that both CRF++ [15j and crfsgd [3] use logarithmic computation in the 
forward-backward recursions, that is, updating logat{y) and log/3t(y) rather than at{y) and f3t{y) 
in ([7]) and ([8]). The advantage of logarithmic computations is that numerical over /underflows 
are avoided whatever the length T^*-* of the sequence, whereas the linear form of ^ and ([8]) is 
only suitable for sequences whose length is less than a few tens. On the other hand, logarithmic 
computation is not the only way of avoiding numerical issues (the "scaling" solution traditionally 
used for HMMs [5j applies as well here) and is very inefficient from an implementation point 
of view due to the repeated calls to the exp function (see Section l6.2p . This being said, when 
logarithmic computations are used, may be used in a similar fashion to reduce the complexity 
of the logarithmic update when Mt+i{y' ,y) is sufficiently sparse. 

Note that although we focus in this paper on the complexity of the training phase, the above 
idea may also be used to reduce the computational burden associated with Viterbi (or optimal 
sequence- wise) decoding. Indeed, at position t + 1, the forward pass in the Viterbi recursion 
amounts to computing 

et+i(y) = inax {et{y') + Aj^/,j,,^,+, } + Hy,xt+i (10) 

y'eY 

where et{y) denotes the conditional log- likelihood of the optimal labelling of the t first tokens 
subject to the constraint that the last label is y (omitting the constant — log Zg{x) which is 
common to all possible labellings). Assuming that A{y,xt+i) = {y' € Y : Xy'^y^xt+i 0} is limited 
to a few labels, it is possible to implement (ITOj) as 

et+i{y)=max\ max et(y'), max (et{y') + Xy',y,xt+i) \ + f^y,xt+i 
I y'(^A{y,xt+i) y'eA(y,xt+i) J 

where A{y,xt+i) denotes the elements of Y that are not in the current active set A(y,xt+i) of 
bigram features. Hence the number of required additions now is of the order of the number of 
active features, \A{y, xt+i)\, rather than equal to the number of labels 

4 Parameter Estimation Using Blockwise Coordinate Descent 
4.1 Regularization 

The standard approach for parameter estimation in CRFs consists in minimizing the logarithmic 
loss l{9) defined by (H]) with an additional squared l'^ penalty term where p2 is a regu- 

larization parameter. The objective function is then a smooth convex function to be minimized 
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over an unconstrained parameter space. Hence, any numerical optimization strategy may be used 
for this purpose and popular solutions include limited memory BFGS (L-BFGS) [IB] or conjugate 
gradient. Note that it is important however, to avoid numerical optimizers that require the full 
Hessian matrix (e.g., Newton's algorithm) or approximations of it due to the size of the parameter 
vector in usual applications of CRFs. 

In the following, we consider the elastic net penalty [SS] which combines and £^ regularizers 
and yields an objective function 



where pi and p2 are regularization parameters. The use of both types of penalty terms seems 
preferable in log-linear conditional models, as it makes it possible to control both the number of 
non zero coefficients (through pi) and to avoid the numerical problems that might occur in large 
dimensional parameter settings if the magnitude of the ^^s is not sufficiently constrained by the 
penalty. 

It may also be rewarding to look for some additional information in hierarchical and group 
structure of the data. An example is the group lasso estimator, introduced in [21] as an extension 
of the lasso. The motivation for the group lasso is to select not individual variables, but whole 
blocks of variables. Typically, the penalty takes the form of the sum of the i'^ norms of predefined 
blocks of the parameter vector. This idea has been further extended in [3l] under the name of 
Composite Absolute Penalties (CAP) for dealing with more complex a priori parameter hierarchies 
while still retaining an overall convex penalty term. Although our approach could be suitable for 
this more complex choices of the penalty function, we restrict ourselves in the following to the 
case of the elastic net penalty. 

4.2 Coordinate Descent 

The objective function in (jlip is still convex but not differentiable everywhere due to the penalty 
term. Although different algorithms have been proposed to optimize such a criterion, we believe 
that the coordinate-wise approach of jllj has a strong potential for CRFs as the update of the 
parameter 9k only involves carrying out the forward-backward recursions for those sequences that 
contain symbols x such that at least one of the values {/fc(y', y, x)}(y_j^/)gy2 is non null, which is 
most often much smaller than the total number of training sequences. This algorithm operates 
by first considering a local quadratic approximation of the objective function around the current 
value 6: 



K^) + Pi||^lli + yll^ 



(11) 




•St 



dl{9) 



1 d'^i{e) 



a ^2 



+ pi\Ok\ + ^el 



(12) 



Then, the minimizer of the approximation (jl2p is easily found to be 




(13) 



where s is the soft-thresholding function defined by 

z — p if z > p 

s{z, p) = < z + p if z < -p 

otherwise 

v 



(14) 
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Interestingly, [9] originally proposed a similar idea but based on a different local approximation 
of the behavior of the logarithmic loss. In [9j, the local behavior of the function l{9) is approximated 
under a form that is equivalent to the first-order only and leads to a closed-form coordinate-wise 
optimization formula as well. This approximation is however explicitly based on the fact that 
each parameter 9^ is multiplied by a function that takes its values in {0, 1}. This property is not 
verified for CRF, since 6^ is multiplied by X^^i fk{yt-i,yti ^t), which can be more than 1 if the 
corresponding feature is observed at several positions in the sequence. 



4.3 Coordinate Descent for CRFs 

To apply the algorithm described above for CRFs, one needs to be able to compute (|14|) . which 
requires to evaluate the first and second order derivatives of l{9)- If the first order derivative is 
readily computable using the forward-backward recursions described in Section 13.21 and ((S)) , the 
exact computation of the second derivative is harder for CRFs. In fact, standard computations 
show that the diagonal term of the Hessian is 

= Z^ j^Pfl(ylxW) \l^Jk{yt-i,yt,x\'' 

rp{i) 

-^Pe(yixW) fkiyt~uyt,xP 

t=l 

The first term is problematic as it involves the conditional expectation of a square which cannot be 
computed only from the pairwise probabilities Pg{yt-i = y',yt = y|x*^*^) returned by the forward- 
backward procedure. It can be shown (see Chapter 4 of [5J and [Ij) that UfTEt} can be computed 
using auxiliary recursions related to the usual forward recursion with an overall complexity of order 
|y|2 X r« per sequence. Unfortunately, this recursion is specific for each index k and cannot be 
shared between parameters. As will be shown below, sharing (part of) the computations between 
parameters is desirable feature for handling non trivial CRFs; we thus propose to use instead the 
approximation 




-2^2^^My\^(^y)fkiyt-i,yt,xt 



Ep,(y|x(')) fk{yt-i,yt,xl ') j (16) 

This approximation amounts to assuming that, given x^*), /fc(?/j_i, y^, x|*^) and /fc(ys_i, y^, xi*^) 
are uncorrelated when s ^ t. Note that this approximation is exact when the feature fk is only 
active at one position along the sequence. It is likely that the accuracy of this approximation is 
reduced when fk is active twice, especially if the corresponding positions s and t are close. 

The proposed coordinate descent algorithm applied to CRFs is summarized as Algorithm [TJ 
A potential issue with this algorithm is the fact that, in contrast to the logistic regression case 
considered in [11], we are using an approximation to d'^l{9) / dO^. which could have a detrimen- 
tal effect on the convergence of the coordinate descent algorithm. An important observation is 
that (fT3|) - (fH|) used with an approximated second-order derivative still yield the correct stationary 
points. 
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Algorithm 1 Coordinate Descent for CRF 



while Convergence criterion is not met do 
for k = 1 : K do 

for Sequences for which f/. is active do 

Perform sparse forward-backward, 
end for 

Compute dl{9)/dek and dH{d)/del from ©-(US]). 
Update 6 according to (fT3D- (fTtll . 
end for 
end while 



To see why it is true, assume that 9 is such that (fT3|) - p^ leave Ok unchanged (i.e., 6k = Ok)- 
If Ok = 0, this can happen only if \dl{0) / dOk\ < pi, which is indeed the first order optimality 
condition in 0. Now assume that Ok > 0, the fact that Ok is left unmodified by the recursion 
implies that 6kP2 + dl{0)/dOk + pi = 0, which is also recognized as the first order optimality 
condition (note that since Ok ^ 0, the criterion is differentiable at this point). The symmetric 
case, where 6k < 0, is similar. Hence, the use of an approximated second-order derivative does 
not prevent the algorithm from converging to the appropriate solution. A more subtle issue is the 
question of stability: it is easily checked that if d'^l{0)/dO'^ is smaller than it should be (remember 
that it has to be positive as l{0) is strictly convex), the algorithm can fail to converge even for 
simple functions (e.g., if l{0) is a quadratic function). An elaborate solution to this issue would 
consist in performing a line search in the "direction" : 

s(a -QyOk-^,pi 



a ^^ + P2 

where < a < 1, is chosen as close as possible to 1 with the constraint that it indeed leads to 
a decrease of the objective function (note that the step size affects only the second-order term in 
order to preserve the convergence behavior). On the other hand, coordinate descent algorithms 
are only viable if each individual update can be performed very quickly, which means that using 
line search is not really an option. In our experiments, we found that using a fixed value of 
a = 1 was sufficient for Algorithm [H probably due to the fact that the second-order derivative 
approximation in (^T6\i is usually quite good. For the blockwise approach described below, we had 
to use somewhat larger values of a to ensure stability (typically, in the range 2-5 near convergence 
and in the range 50-500 for the very few initial steps of the algorithm in cases where it is started 
blindly from arbitrary parameter values). Alternative updates based on uniform upper-bounds of 
the Hessian could also be derived in a fashion similar to the work reported in [14j . 

4.4 Blockwise Coordinate Descent for CRFs 

The algorithm described in the previous section is efficient in simple problems (see Section 16. ip 
but cannot be used even for moderate size applications of CRFs. For instance, the application to 
be considered in Section [6] involves up to millions of parameters and single component coordinate 
descent is definitely ruled out in this case. Following we investigate the use of blockwise 
updating schemes, which update several parameters simultaneously trying to share as much com- 
putations as possible. It turns out that the case of CRFs is rather different from the polytomous 
logistic regression case considered in (llj and requires specific blocking schemes. In this discussion. 
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we consider the parameterization defined in ([3]) which makes it easier to highhght the proposed 
block structure. 

Examining the forward-backward procedure described in Section 13.21 shows that the evaluation 
of the first or second order derivative of the objective function with respect to fiy^x or Xy',y,x requires 
to compute the pairwise probabilities Pg{yt = y", yt+i = y'lx*^*-*) for all values of {y", y') G Y"^ and 
for all sequences x^*) which contain the symbol x at any position in the sequence. Hence, the most 
natural grouping in this context is to simultaneously update all parameters {lJ'y,x-,\' ,y,x}{y' ,y)&Y'^ 
that correspond to the same value of x. This grouping is orthogonal to the solution adopted for 
polytomous regression in [11], where parameters are grouped by common values of the target label. 

Algorithm 2 Blockwise Coordinate Descent for CRF 
while Convergence criterion is not met do 
for X ^ X do 

for Sequences which contain the symbol x do 

Perform sparse forward-backward on relevant indices, 
end for 
Compute 

X)/dtiy.x, dH{fi, \)/dnl.,}y(.Y 
{dl{n, X)/dXy>^y^x, d'^lifx, X)/dXl, y ,,}(^y,^y)^Y^ 

using ([5]) and (fT6|) . 

Update {fJ'y,x}yeY and then {Xy\y^x}{y',y)eY^ according to (fT3|) - (fH|) . 
end for 
end w^hile 



Different variants of this algorithm are possible, including updating only one of the two types 
of blocks {{fJ-y,x}yeY 01 {Aj,',y,x}(j/',j/)ey2 ) at a time. Although the block coordinate-wise algorithm 
requires scanning all the |A| possible symbols x at each iteration, it is usually relatively fast 
due to the fact that only those sequences that contain x are considered. In addition, careful 
examination reveals that for each sequence that contains the token x, it is only required to carry 
the forward recursion up to the index of last occurrence of x in the sequence (and likewise to 
perform the backward recursion down to the first occurrence of x). The exact computational 
saving will however depend on the target application as discussed in Section [6.21 below. 

5 Discussion 

As mentioned above, the standard approach for CRFs is based on the use of the i'^ penalty term 
and the objective function is optimized using L-BFGS [IB], conjugate gradient or Stochastic 
Gradient Descent (SOD) [2]. The CRF training softwares CRF++ [TS] and CRFsuite [23] use L- 
BFGS while crfsgd [3] is, as the name suggests, based on SGD. The latter approach differs from the 
others in that it processes the training sequences one by one: thus each iteration of the algorithm 
is very fast and it is generally observed that SGD converges faster to the solution, especially for 
large training corpora. On the other hand, as the algorithm approaches convergence, SGD becomes 
slower than global quasi-Newton algorithms such as L-BFGS. [31j discusses improvements of the 
SGD algorithm based on the use of an adaptive step size whose computation necessitates second- 
order information. However, these approaches based on the i'^ penalty term do not perform feature 
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selection. 

To our knowledge, [20] made the first attempt to perform model selection for conditional 
random fields. The approach was mainly motivated by [7J and is based on a greedy algorithm 
which selects features with respect to their impact on the log-likelihood function. Related ideas 
also appear in [Q. These greedy approaches are different from our proposed algorithm in that 
they do not rely on a convex optimization formulation of the learning objective. 

To deal with penalties, the simplest idea is that of [13] which was introduced for maximum 
entropy models but can be directly applied to conditional random fields. The main idea of 
is to split every parameter 9 into two positive constrained parameters, and 0~ , such that 
9 = 9^ — 9~ . The penalty takes the form p{9^ — 9^). The optimization procedure is quite 
simple, but the number of parameters is doubled and the method is reported to have a slow 
convergence [1]. A more efficient approach is the already mentioned orthant-wise quasi- Newton 
algorithm introduced in [1]. |12j shows that the orthant-wise optimization procedure is faster 
than the algorithm proposed by |13j and performs model selection even in very high-dimensional 
problems with no loss of performance compared to the use of the i'^ penalty term. Orthant-wise 
optimization is available in the CRFsuite [23j package. Recently, [3^ proposed an adapted version 
of SGD with £^ penalization, which is claimed to be much faster than the orthant-wise algorithm. 

As observed in |24^ I12j . regularization per se does not in general warrant improved test 
set performance. We believe that the real challenge is to come up with methods for CRFs that 
can take profit of the parameter sparsity to either speed up processing or, more importantly, 
make it possible to handle larger "virtual" sets of parameters (i.e., a number of parameters that 
is potentially very large but only a very limited fraction of them being selected). The combined 
contributions of Sections |3] and [Hare a first step in that direction. Related ideas may be found in |6] 
who considers "generalized" feature functions. Rather than making each feature function depend 
on a specific value of the label (or on specific values of label pairs), the author introduces functions 
that only depend on subsets of (pairs of) labels. This amounts to introducing tying between some 
parameter values, a property that can also be used to speed-up the forward-backward procedure 
during training. This technique allows to considerably reduce the training time, with virtually no 
loss in accuracy. From an algorithmic perspective, this work is closest to our approach, since it 
relies on a decomposition of the clique potential into two terms, the first has a linear complexity 
(w.r.t. the number of labels), and the other is sparse; this idea was already present in [26]. This 
method however requires to specify a priori the tying pattern, a requirement that is not needed 
here. The important dependencies emerge from the data, rather than being heuristically selected 
a priori. A somewhat extreme position is finally advocated in [17], where the authors propose 
to trade the explicit modeling of dependencies between labels for an increase in the number of 
features testing the local neighborhood of the current observation token. Our proposal explores the 
opposite choice: reducing the number of features to allow for a better modeling of dependencies. 

6 Experiments 
6.1 Simulated data 

The experiments on an artificial dataset reported here are meant to illustrate two aspects of the 
proposed approach. Firstly, we wish to show that considering unnecessary dependencies in a model 
can really hurt the performance, and that using i'^ and £^ regularization terms can help solve this 
problem. Second, we wish to demonstrate that the blockwise algorithm (Algorithm [2|) enables to 
achieve accuracy results that are very close to those obtained with the coordinate descent approach 
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Figure 2: Performance of the models on artificial data. Models Ml — M3 are trained with i'^ 
penalty (L-BFGS), models M4 — M8 with the penalty term (block coordinate- wise descent). 
Above: performance on training set. Below: performance on testing set. 



(Algorithm [T]) . 

The data we use for these experiments are generated with a first-order hidden Markov model. 
Each observation and label sequence has a length of 5, the observation alphabet contains 5 values, 
and the label alphabet contains 6 symbols. This HMM is designed in such a way that the transition 
probabilities are uniform for all label pairs, except for two. The emission probability matrix on 
the other hand has six distinctively dominant entries such that most labels are well identified from 
the observations, except for two of them which are very ambiguous. The minimal (Bayes) error 
for this model is 15.4%. 

Figure [2] compares several models: Ml contains both {yt-i,yt, xt) and (yt, xt) features, M2 and 
M3 are simpler, with M2 containing only the bigram features, and M3 only the unigram features. 
The models M1-M3 are penalized with the i"^ norm. Models M4-M8 contain both features, bigram 
and unigram, but are penalized by the elastic net penalty term. For the ^^-penalized models (Ml- 
M3), the regularization factor p2 is set to its optimal value (obtained by cross validation). For 
M4-M8 however, the value of p2 does not influence much the performance and is set to 0.001 while 
M4-M8 correspond to different choices of pi, as shown in Table [TJ 

For this experiment, we used only = 10 sequences for training, so as to reproduce the 
situation, which is prevalent in practical uses of CRFs, where the number of training tokens (here 
10 X 5 = 50) is of the same order as the number of parameters, which ranges from 6 x 5 = 30 
for M3 to 6 X 5 -h 6^ X 5 = 210 for Ml and M4~M8. Figure [2] displays box-and-whiskers plots 
summarizing 100 independent replications of the experiment. 
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M5 


M6 


M7 


M8 


Pi 


0.001 


0.01 


0.1 


1 


2.5 


Number of active /x 


28.5 


15.0 


10.9 


6.2 


5.8 


Number of active A 


50.6 


26 


17.2 


4.9 


1.3 



Table 1: Impact of pi on the number of active features {p2 = 0.001). 

Unsurprisingly, Ml and M2, which contain more parameters, perform very well on the training 
set, much better than M3. The test performance tells a different story: M2 performs in fact 
much worse that the simple unigram model M3, which is all the more remarkable that we know 
from the simulation model that the observed tokens are indeed not independent and that the 
models are nested (i.e. any model of type M3 corresponds to a model of type M2). Thus, even 
with regularization, richer models are not necessary the best, hence the need for feature selection 
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techniques. Interestingly, Ml which embarks both unigram and bigram features, achieves the 
lowest test error, highlighting the interest of using simultaneously both feature types to achieve 
some sort of smoothing effect. With proper choice of the regularization (here, M7), £^-penalized 
models achieve comparable test set performance. As a side effect of model selection, notice that 
M7 is somewhat better than Ml at predicting the test performance at training time: for Ml, the 
average train error is 6.4% vs. 18.5% for the test error while for M7, the corresponding figures 
are 10.3% and 17.9%, respectively. Finally, closer inspection of the sparsity pattern determined 
by M7 shows that it is most often closely related to the structure of the simulation model which 
is also encouraging. 
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Figure 3: Comparison of the coordinate-wise update with the block update on simulated data. 
Above: values of logarithmic loss being minimized. Below: performance on test data. 

Figure [3] compares the behavior of the coordinate- wise update policy with the blockwise ap- 
proach, where one iteration refers to a complete round where all model parameters are updated 
exactly once. As can be seen on these graphs, the convergence behavior is comparable for both 
approaches, both in terms of objective function (top plot) and test error (bottom plot). Each 
iteration of the blockwise algorithm is however about 50 times faster than the coordinate-wise 
update, that roughly correspond to the size of each block. Clearly, the blockwise approach is the 
only viable strategy when tackling more realistic higher-dimensional tasks such as those considered 
in the next two sections. 



6.2 Experiments with Nettalk 

This section presents results obtained on a word phonetization task, using the Nettalk dictio- 
nary [25]. This dataset contains approximately 18000 English words and their pronunciations. 
Graphemes and phonemes are aligned at the character level thanks to the insertion of a "null 
sound" in the latter string when it is shorter than the former. The set of graphemes X thus 
includes 26 letters, the alphabet of phonemes Y contains 53 symbols. In our experiments, we 
consider that each phoneme is a target label and we consider two different settings. The first only 
uses features that test the value of one single letter, and is intended to allow for a detailed analysis 
of the features that are extracted. The second setting is more oriented towards performance and 
uses features that also test the neighboring letters. The training set comprises 16452 sequences and 
the test set contains 1628 sequences. The results reported here are obtained using the blockwise 
version of the coordinate descent approach (Algorithm [2j) . 

Figure H] displays the estimated parameter values when the penalty is set to its optimal 
value of 0.2 (see Table [2] below) . Comparing this figure with Figure [U shows that the proposed 
algorithm correctly identifies those parameters that are important for the task while setting the 
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Figure 4: Nettalk task, i norm of the parameters estimated with elastic net penalty, pi = 0.2, 
P2 = 0.05 Above: \ny^x\ for the 53 phonemes y and 26 letters x. Below: |Ay'^y^a;| for the 53 
phonemes y and 26 letters x. 

other to zero. It confirms the impression conveyed by Figure [U that only a very limited number 
of features is actually used for predicting the labels. The first column in this figure corresponds 
to the null sound and is unsurprisingly associated with almost all letters. One can also directly 
visualize the ambiguity of the vocalic graphemes which correspond to the first ('a'), fifth ('e'), 
ninth ('i'), etc. gray rows; this contrasts with the much more deterministic association of a typical 
consonant grapheme with a single consonant phoneme. 
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= 


30 
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14.0 
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SBCD, pi = 


0.1 


30 


76 


13.5 


14.2 


1155 


4171 


SBCD, pi = 


0.2 


30 


70 


14 


14.2 


1089 


3598 


SBCD, pi = 


0.5 


30 


63 


13.7 


14.3 


957 


3077 


SBCD, pi = 


= 1 


30 


55 


16.3 


16.8 


858 


3111 


SBCD, pi = 


= 2 


30 


43 


16.4 


16.9 


760 


2275 


SBCD, pi = 


10 


30 


25 


17.3 


17.7 


267 


997 


OWL-QN, pi 


= 0.1 


50 


165 


13.5 


14.2 


1864 


4079 


L-BFGS 


90 


302 


13.5 


14.1 


74412 


SGD 


30 


17 


18.5 


19.1 


74412 



Table 2: Upper part: summary of results for various values of pi for the proposed Sparse Blockwise 
Coordinate Descent (SBCD) algorithm (with p2 = 0.001) and orthant-wise L-BFGS (OWL-QN). 
Lower part: results obtained with i'^ regularization only, for L-BFGS and stochastic gradient 
descent (SGD). 

Table [2] gives the per phoneme accuracy with varying level of sparsity, both for the proposed 
algorithm (SBCD) and the orthant-wise L-BFGS (OWL-QN) strategy of [1]. For comparison 
purposes the lower part of the table also reports performance obtained with i'^ regularization only. 
For ^^-based methods (L-BFGS and SGD) the regularization constant was set to its optimal value 
determined by cross validation as p2 = 0.02. The proposed algorithm (SBCD) is coded in C 
while OWL-QN and L-BFGS use the CRF-^-F package [15] modified to use the liblbf gs hbrary 
provided with CRFsuite [23j that implements the standard and orthant-wise modified versions of 
L-BFGS. Finally, SGD use the software of All running times were measured on a computer 
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with an Intel Pentium 4 3.00GHz CPU and 2G of RAM memory. Measuring running times is a 
difficult issue as each iteration of the various algorithm does not achieve the same improvement 
in term of performance. For our method, 30 iterations were found necessary to reach reasonable 
performance in the sense that further iterations did not significantly reduce the error rates (with 
variations smaller than 0.3%). Proceeding similarly for the other methods showed that OWL-QN 
and L-BFGS usually require more iterations to reach stable performance, which is reflected in 
Table [21 Finally, SGD requires few iterations (where an iteration is defined as a complete scan of 
all the training sequences) although we obtained disappointing performance on this dataset with 
SGD. 

First, Table [2] shows that for pi = 0.1 or 0.2 our method reaches an accuracy that is comparable 
with that of non-sparse trainers (SBCD with pi = or L-BFGS) but with only about 5000 active 
features. Note in particular the dramatic reduction achieved for the bigram features \y\y'^x as 
the best accuracy /sparsity compromise {p2 = 0.2) nullifies about 95% of these parameters. We 
observe that the performance of SBCD (for pi = 0.1) is comparable to that of OWL-QN, which is 
reassuring as they optimize related criterions, except for the fact that OWL-QN is based on the use 
of the sole penalty. There are however minor differences in the number of selected features for 
both methods. In addition to the slight difference in the penalties used by SBCD and OWL-QN, 
it was constantly observed in all our experiments that for ^^-regularized methods the performance 
stabilizes much faster than the pattern of selected features which may require as much as a few 
hundreds of iterations to fully stabilize. This effect was particularly noticeable with the OWL-QN 
algorithm. We have not found satisfactory explanation regarding the poor performance of SGD on 
this dataset: further iterations do not significantly improve the situation and this failure has not 
been observed on the CoNLL 2003 data considered below. In general, SGD is initially very fast 
to converge and no other algorithm is able to obtain similar performance with such small running 
time. The fact that SGD fails to reach satisfactory performance in this example is probably 
related to an incorrect decrease of the step size. In this regard, an important difference between 
the Nettalk data and the CoNLL 2003 example considered below is the number of possible labels 
which is quite high here (53). A final remark regarding timings is that all methods except SBCD 
use logarithmic computation in the forward-backward recursions. As discussed in Section [3.21 this 
option is slower by a factor which, in our implementation, was measured to be about 2.4. Still, 
the SBCD algorithm compares favorably with other algorithms, especially with OWL-QN which 
optimizes the same objective function. 

Table [2] also shows that the running time of SBCD depends on the sparsity of the estimated 
model, which is fully attributable to the sparse version of the forward-backward recursion intro- 
duced in Section 13.21 To make this connection clearer. Figure [5] displays the running time as a 
function of the number of active features (rather than pi). When the number of active feature is 
less than 10000, the curve shows a decrease that is proportional to the number of active features 
(beware that the x-axis is drawn on a logarithmic scale). The behavior observed for larger numbers 
of active features, where the sparse implementation becomes worse than the baseline (horizontal 
blue line) can be attributed to the overhead generated by the use of sparse matrix-vector multi- 
plications for matrices that are not sparse. Hence the ideas exposed in Section [3.21 have a strong 
potential for reducing the computational burden in situations where the active parameter set is 
very small compared to the total number of available features. Note also that the OWL-QN 
optimizer could benefit from this idea as well. 

The simple feature set used in the above experiments is too restricted to achieve state-of-the- 
art performance for this task. We therefore conducted another series of experiments with much 
larger feature sets, including bias terms and tests on the neighbors of the letter under focus. For 
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Figure 5: Running time as a function of the number of active features for the SBCD algorithm 
on the Nettalk corpus. The blue line correspond to the running time when using non-sparse 
forward-backward. 

these experiments, we keep the same baseline feature set and add the bias terms = y) and 
l{yt-i = y',yt = y) for all possible values of {y,y')- For the context, we consider two variants: in 
the first, termed pseudo n-gram, we also add l{yt = y, xt±i = x) and = y', yt = y, xt±i = x) 

for all values of {x,y,y') and of the offset 1 < i < n — 1/2. In other terms, in this variant, we 
test separately the values of the letters that occur before and after the current position. In the 
second variant, termed n-gram, we add features l{yt = y,zt = z) and l{yt-i = y',yt = y,zt = z) 
where zt denotes the letter k-gram centered on xt, and z ranges over all observed k-grams (with 
k < n). This second variant seems of course much more computationally demanding [T7j as it 
yields a much higher number of features (see the top line of Table [3l where the total number of 
features is given in millions). 



M feat. 


pseudo 
n = 3 
0.236 


n-gram 
n = 5 
0.399 


n-g] 

n = 3 
14.2 


"am 

n = 5 
121 


20 iter. 
30 iter. 


8.98 (12.5) 
8.67 (10.9) 


6.77 (19.8) 
6.65 (17.1) 


8.22 (12.8) 
8.04 (11.7) 


6.51 (21.7) 
6.50 (20.1) 



Table 3: Experiments with contextual features: performance of the SBCD algorithm after 20 
and 30 iterations in terms of error rate and, between parentheses, number of selected features (in 
thousands) . 

As can be seen in Tabled these extended feature sets yield results that compare favorably with 
those reported in |33| I19j . with an phoneme-error-rate of 6.5% for the 5-gram system. We also 
find that even though both variants extract comparable numbers of features, the results achieved 
with "true" n-grams are systematically better than for the pseudo n- grams. More interestingly, 
the n-gram variants are also faster: this paradoxical observation is due to the fact that for the 
n-gram features, each block update only visits a very small number of observation sequences, 
and further, that for each position, a much smaller number of features are active as compared 
to the pseudo n-gram case. Finally, the analysis of the performance achieved after 20 and 30 
iterations suggests that the n-gram systems are quicker to reach their optimal performance. This 
is because a very large proportion of the n-gram features are zeroed in the first few iterations. For 
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instance, the 5-gram model, after only 9 iterations, has an error rate of 6.56% and selects only 
27.3 thousand features out of 121 millions. These results clearly demonstrate the computational 
reward of exploiting the sparsity of the parameter set as described in Section [3) in fact, training 
our largest model takes less than 5 hours (for 20 iterations), which is quite remarkable given the 
very high number of features. 

6.3 Experiments with CoNLL 2003 

Named entity recognition consists in extracting groups of syntagms that correspond to named 
entities (e.g., names of persons, organizations, places, etc.). The data used for our experiments 
are taken from the CoNLL 2003 challenge [29J and implies four distinct types of named entities, 
and 8 labels. Labels have the form B-X or I-X, that is begin or inside of a named entity X 
(however, the label B-PER is not present in the corpus). Words that are not included in any 
named entity, are labeled with O (outside). The train set contains 14987 sequences, and the test 
set 3684 sequences. At each position in the text, the input consists of three separate components 
corresponding respectively to the word (with 30290 distinct words in the corpus), part-of-speech 
(46), and syntactic (18) tags. To accommodate this multidimensional input the standard practice 
consists in superposing unigram or bigram features corresponding to each of those three dimensions 
considered separately. The parameters we use in the model are of the form ,\\y^x''} 
d G {1,2,3}, which corresponds to a little more than 9 million parameters. Hence, the necessity 
to perform model selection is acute. 
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Figure 6: Performance comparison of different model selection approaches on CoNLL 2003 (En- 
glish): test set B. 

To illustrate the efficiency of £^-based feature selection, we compare it with three simple- 
minded approaches to feature selection, which are often used in practice. The first one, termed 
"cut-off" , consists in incorporating only those features that have been observed sufficiently often 
in the training corpus. This amounts to deleting a 'priori all the rare dependencies. The second 
strategy, termed Ml preselection^ selects features based on their Mutual Information (MI) with the 
label, as in [32] (where the MI is referred to as the information gain). The third option consists 
in training a model that is not sparse (e.g., with an i"^ penalty term) and in removing a posteriori 
all features whose values are not of sufficient magnitude. Figure El compares the error rates 
obtained with these strategies to those achieved by the SBCD or OWL-QN algorithms. Obviously, 
a priori cut-off strategies imply some performance degradation, although MI preselection clearly 
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dominates frequency-based selection. A posteriori thresholding is more efficient but cannot be 
used to obtain well-performing models that are very sparse (here, with less than 10000 features). 
From a computational point of view, a posteriori thresholding is also penalized by the need to 
estimate a very large model that contains all available features. 

In this experiment, SBCD proves computationally less efficient, for at least two reasons. First, 
many POS or chunk tags appear in all training sequences; the same is true for very frequent words: 
this yields many very large blocks containing almost all training sentences. Second, the sparse 
forward-backward implementation is less efficient than in the case of the phonetization task as the 
number of labels is much smaller: SBCD needs 42 minutes (with pi = 1, corresponding to 6656 
actives features) to achieve a reasonable performance while OWL-QN is faster, taking about 5 
minutes to converge. If sparsity is not needed, SGD appears to be the most efficient method for 
this corpus as it converges in less than 4 minutes. L-BFGS in contrast requires about 25 minutes 
to reach a similar performance. 

7 Conclusions 

In this paper, we have proposed an algorithm that performs feature selection for CRFs. The 
benefits of working with sparse feature vectors are twofold: obviously, less features need to be 
computed and stored; more importantly, sparsity can be used to speed up the forward-backward 
and the Viterbi algorithms. Our method combines the f.^ and £^ penalty terms and can thus be 
viewed as an extrapolation of the elastic net concept to CRFs. To make the method tractable, 
we have develop a sparse version of the forward-backward recursions; we have also introduced and 
validated two novel ideas: an approximation of the second order derivative of the CRF logarithmic 
loss as well as an efficient parameter blocking scheme. This method has been tested on artificial 
and real-world data, using large to very large feature sets containing more than one hundred 
million features, and yielding accuracy that is comparable with conventional training algorithms, 
and much sparser parameter vectors. 

The results obtained in this study open several avenues that we wish to explore in the future. 
A first extension of this work is related to finding the optimal weight for the penalization term, 
a task that is usually achieved through heuristic search for the value(s) that will deliver the best 
performance on a development set. Based on our experiments, this search can be performed 
efficiently using pseudo regularization-path techniques, which amount here to start the training 
with a very constrained model, and to progressively reduce the weight of the term so as to 
increase the number of active features. This can be performed effectively at very little cost by 
restarting the blockwise optimization from the parameter values obtained with the previous weights 
setting (so called "warm-starts" ) , thereby greatly reducing the number of iterations needed to reach 
convergence. A second interesting perspective, aiming at improving the training speed, is based 
on the observation that after a dozen iterations or so, the number of active features is decreasing 
steadily. This suggests that those features that are inactive at that stage will remain inactive 
till the convergence of the procedure. Hence, in some situations, limiting the updates to the 
features that are currently active can be an efficient way of improving the training speed. Finally, 
the sparse forward-backward implementation appears to be most attractive when the number of 
labels is very large. Hence, extensions of this idea to cases where the features include, for instance, 
conjunctions of tests that operate on more than two successive labels are certainly feasible. The 
perspective here consists in taking profit of the sparsity to allow for inclusion of longer range label 
dependencies in CRFs. 
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